Transforming the Web into Data (with Python)

Agenda

conceptual introduction to web scraping
tools for non-programmers
tools for python programmers
code tour
break
scrape from scratch exercise

why scrape the web

there is a lot of human activity on the web, which produces
new and unique data/traces, that can lead to
insight & understanding for data science, the social sciences, and the humanities.
Ground Truthiness - remember the web is only a particular representation of human behavior
You can also scrape for fun & profit 💰

so what is scraping the web?

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. - Wikipedia

conceptual introduction to web scraping

there roughly three steps - results may vary
1. fetching resources - asking a computer "hey, can you send me http://google.com?"
2. parsing documents - creating a machine readable representation of a web page
3. extracting data - pulling out just the information of interest

fetching resources

Hyper Text Transfer Protocol (HTTP)
- fundamentally about requests & responses
- the language of the web
- four request methods: GET, POST, PUT, DELETE
- URLs point to resources
verbs & nouns
- request methods are the verbs
- resouces are the nouns
- URLs are the proper nouns
stateless
- doesn't have a good memory
- sessions - how HTTP servers remember "state"
- cookies - the token passed in HTTP requests & responses

fetching resources

web pages - made for humans
- HyperText Markup Language(HTML) - defines document structure
- Cascading Syle Sheets(CSS) - makes web pages pretty
- JavaScript - makes web pages interactive
- so many more standards... W3C
APIs - made for machines
- Application Programming Interface - fancy name for how computer machines connect with each other
- how to get data from the social web (i.e. Twitter, Facebook, etc.)
- related, but distinct from web scraping (more structured, access control)

parsing documents

HTML documents are composed of elements or tags
- the <html> tag is the root of the tree
the HTML specification defines a bunch of tags
- <p>this is a paragraph tag with text <em>inside</em> of it</p>
- <a href="http://pitt.edu">This is an anchor tag, basically a link</a>
- not enough time to review all of them
parsing transforms the barf into a tree of elements
- also called the the Document Object Model or DOM

<!DOCTYPE html>
<html>
  <head>
    <title>A basic webpage</title>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <ul>
      <li>First item in a unordered list</li>
      <li>Second item in an unordered list</li>
      <div class="stuff">
        <p>Another paragraph separated by a div element.</p>
      </div>
      <table>
  </body>
</html>



In [8]:

    
! curl -s http://pitt.edu | head -n30









    



<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#"><!--<![endif]-->

<head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta charset="utf-8" />
<link rel="shortcut icon" href="http://www.pitt.edu/sites/default/files/pitt_favicon_0.ico" type="image/vnd.microsoft.icon" />
<link rel="shortlink" href="/node/62" />
<link rel="canonical" href="/home" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
  <title>Home | University of Pittsburgh</title>
  <meta name="description" content="The University of Pittsburgh is among the nation's most distinguished comprehensive universities, with a wide variety of high-quality programs in both the arts and sciences and professional fields." />
  <meta name="Keywords" content="University, Pittsburgh, Pitt, College, Learning, Research, Students, Undergraduate, Graduate" />
    
      <meta name="MobileOptimized" content="width">
    <meta name="HandheldFriendly" content="true">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="cleartype" content="on">
	
  <link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_kShW4RPmRstZ3SpIC-ZvVGNFVAi0WEMuCnI0ZkYIaFw.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_vZ_wrMQ9Og-YPPxa1q4us3N7DsZMJa-14jShHgRoRNo.css" media="screen" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_4mmZo2I5oU53mjQh0UjgygKazedTCqZXNvrxFyYrT-g.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_hME6weH8liYUm6qr-IDiSXVwXgjKndoDaEQ2Jq3-W10.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_LwvUaww9zeUDxZ1r2K4dHcSbAEEzbSNA-5Zz2KIgwD4.css" media="all" />
   
  <script src="http://www.pitt.edu/sites/all/modules/jquery_update/replace/jquery/1.5/jquery.js?v=1.5.2"></script>
<script src="http://www.pitt.edu/misc/jquery.once.js?v=1.2"></script>

extracting data

ok, now we are going to get really technical
pull information out of the tree and push it somewhere else

extracting data

how?
- copy & paste
- automated scripts
if you have a lot of data, copy & paste probably won't work for you
if the data are on multiple pages, you will need to crawl with a spider
web crawlers extract the links from a web page, fetch those pages, extract links, fetch, extract, fetch...
scripts and tools help automate this process

extracting data

first step: where are the data in the HTML tree?
right-click & select "inspect element" - works in Firefox, Chrome, & if developer mode enabled in Safari
- THE MATRIX

extracting data

selection is the key
many different ways to select HTML tags
- library APIs - python code for navigating & searching parsed HTML documents
- CSS selectors - used by Cascading Style Sheets, good to know
- XPATH - a query language for XML documents, works for HTML too!
- REGEX? - DON'T EVER USE REGEX TO PARSE HTML!

basic workflow

fetch pages
extract data
extract links
fetch more pages
...
profit?

DATA CLEANING!

challenges in web scraping

logins, paywalls, and access control
- these are not impossible, tools support HTTP sessions & cookies
- throttling - fine line between scraping & denial of service attack
- THE LAW - read the terms of service, copyright? FAIR USE!
dynamic websites
- javascript - hard to scrape because the DOM changes
- AJAX or XMLHttpRequest - pages can asyncronously fetch data & update themselves
the document vs. application centric web
- scraping gmail?
- APIs help, if they exist
mobile web / apps????
- ¯\(ツ)/¯

tools for non-programmers

Import.io - A commercial product, not sure how expensive
Wget - Swiss army knife of web scraping tools, command line
HTTrack - windows tool for copying websites, GUI
ScraperWiki - a service that costs money, if you have a grant...
Scraper Plugin – chrome plugin instad of a service, looks pretty easy to use
Diffbot – more advanced extraction, really nice guys, costs money but they support research if you ask them
EMAIL – sometimes it doesn't hurt to ask!

tools for python programmers

fetching

urllib2 - batteries included
Requests: HTTP for Humans - much nicer than urllib2
Scrapy - a complete framework for web crawling and data extraction

parsing & extracting

Beautiful Soup - the most popular library
Soupy - a wrapper around BS to make life easier
Scrapely - another tool for extracting structured data from web pages
lxml - a bit lower level, supports XPATH, which I prefer

data management

Pandas Data Analysis Library - think R dataframes
sqlite - nice lightweight relational database
json or csv - serializing data to a file

Where to go next?

Learning Python

Python for Informatics - A book as well as online course materials and even a MOOC on Coursera. Everything is FREE and OPEN SOURCE.
Learn Python the Hard Way - Another book & companion website that is NOT FREE.
Codecademy Python Track - A set of online interactive tutorials for learning python.
Google - Seriously, there are a million resources online for leaning Python. Try a few of them out and see which ones work best for you.

Web Scraping

The Ultimate Guide to Web Scraping - A short book that provides a conceptual introduction to web scraping
- "I Don't Need No Stinking API" - A popular blog post written by the author of the aformentioned book.
Mining the Social Web, 2nd Edition - An excellent book for more advanced programmers who are interested in collecting and analyzing data from social websites like Twitter, Facebook, and Github.
Web Scraping with Python - A new book coming summer of 2015 that appears to cover the more technical aspects of scraping the web with python.
Google - Again, seriously, there are a million tutorials on the web. Some are more technical than others.